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Figure  1.  Calibration  of  confidence  judgments  for  forecasts  regarding  8 
possible  outcomes  of  President  Nixon's  trips  to  China  and  the  USSR. 

Source:  Fischhoff  &  Beyth  (1975). 

Figure  2.  Representative  calibration  curves  derived  from  studies  using  10 
two-alternative,  half-range  tasks.  Source:  Lichtenstein  &  Fischhoff, 

(1977). 

Figure  3.  Calibration  curves  for  individuals  providing  supporting,  II 

contradicting,  or  supporting  and  contradicting  reasons.  Each  group's 
calibration  is  compared  with  their  own  performance  on  a  set  of  control 
items.  Source:  Koriat,  Lichtenstein  &  Fischhoff  (1980). 

Figure  4.  Calibration  of  all  responses  to  control  items  in  Study  1.  14 

Curve  includes  3,447  responses  produced  by  112  individuals. 

Figure  5.  Calibration  curves  for  individuals  providing  supporting,  16 

contradicting  or  both  kinds  of  reasons  in  Study  1.  Corresponding 
summary  statistics  appear  in  Table  2. 

Figure  6.  Calibration  curves  for  users  and  non-users  of  1.0,  pooled  29 
across  Studies  1-3.  Corresponding  summary  statistics  are  given  in 
Table  6.  Curves  involve  approximately  5,000  to  16,000  responses  produced 
by  100  to  300  individuals. 
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SUMMARY 


Forecasts  have  little  value  to  decision  makers  unless  it  is  known  how 
much  confidence  to  place  in  them.  Those  expressions  of  confidence  have, 
in  turn,  little  value  unless  forecasters  are  able  to  assess  the  limits 
of  their  own  knowledge  accurately. 

Previous  research  has  shown  very  robust  patterns  in  the  judgments  of 
individuals  who  have  not  received  special  training  in  confidence  assess¬ 
ment:  Knowledge  generally  increases  as  confidence  increases.  However, 
it  increases  too  swiftly,  with  a  doubling  of  confidence  being  associated 
with  perhaps  a  50%  increase  \n  knowledge.  With  all  but  the  easiest  of 
tasks,  people  tend  to  be  overconfident  regarding  how  much  they  know. 

These  results  have  typically  been  derived  from  studies  of  judgments  of 
general  knowledge.  The  present  study  found  that  they  also  pertained  to 
confidence  in  forecasts.  Indeed,  the  confidence-knowledge  curves  observed 
here  were  strikingly  similar  to  those  observed  previously.  The  only 
deviation  was  the  discovery  that  a  substantial  minority  of  judges  never 
expressed  complete  confidence  in  any  of  their  forecasts.  These 
individuals  also  proved  to  be  better  assessors  of  the  extent  of  their  own 
knowledge. 

Apparently  confidence  in  forecasts  is  determined  by  processes  similar  to 
those  that  determine  confidence  in  general  knowledge.  Decision  makers 
can  use  forecasters'  assessments  in  a  relative  sense,  in  order  to  predict 
when  they  are  more  and  less  likely  to  be  correct.  However,  they  should 
be  hesitant  to  take  confidence  assessments  literally.  Someone  is  more 
likely  to  be  right  when  he  or  she  is  "certain"  than  when  he  or  she  is 
"fairly  confident;"  but  there  is  no  guarantee  that  the  certain  forecast 
will  come  true. 
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SUBJECTIVE  CONFIDENCE  IN  FORECASTS 

Since  the  destruction  of  the  Second  Temple, 

prophecy  has  become  the  lot  of  fools. 

-  Hebrew  expression 

What  constitutes  a  wise  forecaster?  It  Is  not  just  someone  who  is  usually 
correct;  that  definition  would  give  undue  deference  to  those  who  make 
forecasts  about  predictable  events.  It  is  not  just  someone  who  is  seldom 
proven  wrong;  that  definition  would  reward  the  makers  of  vague  and  unveri- 
fiable  forecasts.  It  is  not  just  someone  who  provides  a  confident  message 
with  clear  implications  for  action;  that  definition  would  promote  arrogance 
over  thoughtfulness. 

If  one  is  to  take  action  on  the  basis  of  a  forecast,  perhaps  the  most 
desirable  property  is  that  it  be  appropriately  qualified.  That  is,  one 
wants  to  know  how  much  faith  to  put  in  it.  One  measure  of  the  appropriate¬ 
ness  of  expressions  of  faith  in  forecasts  is  their  degree  of  calibration.1 
For  the  sake  of  calibration,  all  statements  of  fact  are  considered  to  carry 
with  them  an  implicit  or  explicit  expression  of  confidence  in  their  truth. 
When  that  expression  is  given  quantitative  form,  the  archetypal  statement 
of  fact  has  the  form  "the  probability  that  statement  A  is  true  is  X." 
Statement  A  may  refer  to  a  discrete  event  (My  bank  account  is  overdrawn.); 
or  a  continuous  one  (The  balance  in  my  bank  account  is  between  -$100  and 
$150.).  It  could  refer  to  the  past  (George  Washington  died  because  of 
poor  medical  treatment.);  present  (The  capitol  of  Saudi  Arabia  is  Mecca.); 
or  future  (Quebec  will  be  a  part  of  Canada  on  January  1,  2000.).  Only 
statements  about  the  future  represent  forecasts,  but  the  evaluation  of  all 
such  expressions  of  confidence  Is  similar.  Except  for  situations  in  which 
an  individual  is  100$  confident  and  wrong,  it  is  hard  to  validate  single 
expressions.  However,  one  can  take  a  set  of  statements  and  see  if  XX  of 
those  assigned  an  XX  chance  of  being  correct  prove  to  be  correct,  once  the 
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truth  of  the  statement  can  be  ascertained.  The  truth  of  forecasts  can  be 
checked  by  seeing  whether  the  predicted  events  occur. 

The  Bayesian,  or  subjectivist,  view  of  probability  underlying  calibration 
studies  assunes  that  probabilities  represent  an  individual's  state  of 
knowledge.  Hence,  it  makes  sense  to  aggregate  probabilities  over  a 
diverse  set  of  statements  and  see  how  well,  in  general,  an  individual 
assesses  the  extent  of  his  or  her  knowledge. 

Crude  retrospective  assessments  of  calibration  may  be  derived  from  looking 
at  the  confidence  expressions  accompanying  the  performance  of  real  tasks. 
Thus,  one  might  find  evidence  of  overconfidence  in  professions  that  make 
confident  judgments  with  no  demonstrated  validity  (e.g.,  predictions  of 
stock  price  movements  [Dreman,  1979;  Slovic,  1972],  psychiatric  diagnoses 
of  dangerousness  [Cocozza  4  Steadman,  1978]).  Unfortunately , such  evidence 
is  not  only  imprecise,  but  also  ambiguous  whenever  "experts"  are  consulted 
(and  paid)  as  a  function  as  the  confidence  they  inspire,  suggesting  that 
they  may  be  tempted  to  misrepresent  how  much  they  know  (Armstrong,  1978). 

Among  real-world  studies,  the  greatest  efforts  to  ensure  candor  and 
explicitness  have  been  with  weather  forecasters,  who  are  rewarded  for  good 
calibration.  Their  performance  is  superb  (e.g.,  Murphy  4  Winkler,  1974, 
1977).  Whether  this  success  is  due  to  training  in  calibration  or  a  by¬ 
product  of  their  general  professional  education  is  unclear.  A  review  of 
other  studies  with  experts  who  have  not  had  calibration  training  suggests 
that  such  training,  and  not  just  education,  is  the  effective  element. 
Experiments,  using  problems  drawn  from  their  respective  areas  of  expertise 
but  isolated  from  real-world  pressures,  have  found  overconfidence  with 
psychology  graduate  students  (Lichtenstein,  4  Fischhoff,  1977),  bankers 
(Stael  von  Holstein,  1972),  clinical  psychologists  (Oskamp,  1962),  execu¬ 
tives  (Moore,  1977),  civil  engineers  (Hynes,  4  Vanmarcke,  1976),  and 
untrained  professional  weather  forecasters  (Root,  1962;  Stael  von  Holstein, 
1971). 


Overconfidence  is  also  the  predominant  result  of  experiments  using  non¬ 
experts  responding  to  general -know! edge  questions  (Lichtenstein,  Fischhoff, 
&  Phillips,  in  press).  Table  1  provides  a  sunmary  of  studies  that  have 
attempted  to  eradicate  overconfidence  by  a  variety  of  manipulations 
including  changing  the  response  mode,  offering  detailed  instructions, 
raising  the  stakes  hinging  on  good  calibration,  and  varying  the  hetero¬ 
geneity  of  the  item  being  judged.  Each  paper  is  represented  by  a  number 
which  is  underlined  if  the  manipulation  seemed  to  improve  calibration. 

From  this  large  set  of  studies,  only  three  procedures  seem  to  be  effective. 
One  is  extensive  training  with  personalized  feedback.  The  second  is 
forcing  respondents  to  list  reasons  why  the  statement  or  answer  they  believe 
in  might  be  wront  (Koriat,  Lichtenstein,  &  Fischhoff,  1980;  Study  No.  18  in 
Table  1).  The  third,  and  least  interesting,  is  to  provide  easier  tasks. 

One  reflection  of  people's  insensitivity  to  how  much  they  know  is  the  fact 
that  their  mean  confidence  changes  relatively  slowly  in  response  to  changes 
in  the  difficulty  of  the  tasks  they  face.  Thus,  when  tasks  become  easier, 
people's  confidence  does  not  rise  commensurately,  leaving  them  under¬ 
confident  for  the  easiest  of  tasks.  In  this  light,  the  preponderance  of 
overconfidence  in  the  literature  reflects,  in  part,  the  (perhaps  natural) 
tendency  not  to  present  people  with  very  easy  questions. 

The  subjectivist  interpretation  of  probability  makes  no  distinction  between 
confidence  in  statements  about  the  future  and  confidence  in  any  other  kind 
of  statement.  Hence,  from  a  formal  perspective,  one  would  expect  that  the 
results  sunmarized  in  Table  1  could  be  generalized  to  the  calibration  of 
forecast  probabilities.  That  is,  one  could  expect  to  find  overconfidence 
that  is  impervious  to  most  of  the  various  manipulations  described  there. 
Formal  equivalence  is  not,  however,  the  same  as  psychological  equivalence. 
One  might  speculate,  for  example,  that  all  other  things  being  equal, 
people  are  less  confident  in  their  knowledge  about  the  future  because  no 
one  knows  about  the  future.  Or  one  might  speculate  that  they  are  more 
confident  because  no  one  can  prove  them  wrong  at  the  moment  of  prediction. 
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TABLE  1 


ATTEMPTS  TO  EXPLAIN  OR  REDUCE  OVERCONFIDENCE 


Strategies 

Studied  by 

Faulty  tasks 

Unfair  tasks 

Raise  stakes 

1,28 

Clarify  instructions/stimuli 

3,10,12,13,20 

Discourage  second  guessing 

12,20 

Use  better  response  modes 

12, 13, 19, 21, 22, 30,32, 33?, 34, 37? 

Ask  fewer  questions 

15 

Misunderstood  tasks 

Demonstrate  alternative  goal 

13 

Demonstrate  semantic  disagreement 

3,13,18,28? 

Demonstrate  impossibility  of  task 

12 

Demonstrate  overlooked  distinction 

14? 

Faulty  judges 

Perfectible  individuals 

Warn  of  problem 

12 

Describe  problem 

3 

Provide  personalized  feedback 

20 

Train  extensively 

1,2,4,16,20,24,25,29,32 

Incorrigible  individuals 

Replace  them 

- 

Recalibrate  their  responses 

2,5,23 

Plan  on  error 

— 

Mismatch  between  judges  and  task 

Restructuring 

Make  knowledge  explicit 

17 

Search  for  discrepant  Information 

17 

Decompose  problem 

- 

Consider  alternative  situations 

- 

Offer  alternative  formulations 

33? 

Education 

Rely  on  substantive  experts 

11,15, 19,23,27,31,35,36 

Use  easier  questions 

8,9,22,26,29,30 

Educate  from  childhood 

6,7 

Noce:  Each  number  represents  a  separate  article.  Manipulations  that  have  proven  at 
least  partially  successful  are  underlined.  Those  that  have  yet  to  be  subjected 
to  empirical  test  or  for  which  the  evidence  is  unclear  are  marked  by  a  question  nark. 
Details  in  Flachhoff  (in  press). 
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KEY  TO  TABLE  1 


1.  Adams  &  Adams  (1958) 

2.  Adams  &  Adams  (1961) 

3.  Alpert  &  Raiffa  (in  press) 

4.  Armelius  (1979) 

5.  Becker  &  Greenberg  (1978) 

6.  Beyth-Marom  &  Dekel  (in  press) 

7.  Cavanaugh  &  Borkovskl  (1980) 

8.  Clarke  (1960) 

9.  Cocozza  &  Steadman  (1978) 

10.  Daves  (in  press) 

11.  Dovie  (1976) 

12.  Fischhoff  &  Slovic  (1980) 

13.  Fischhoff,  Slovic  &  Lichtenstein  (1977) 

14.  Hovell  &  Burnett  (1978) 

15.  Hynes  &  Vanmarcke  (1976) 

16.  King,  Zechmeister  &  Shaughnessy  (in  press) 

17.  Koriat,  Lichtenstein  &  Fischhoff  (1980) 

18.  Larson  &  Reenan  (1979) 

19.  Lichtenstein  &  Fischhoff  (1977) 

20.  Lichtenstein  6  Fischhoff  (1980) 

21.  Lichtenstein,  Fischhoff  &  Phillips  (in  press) 

22.  Ludke,  Stauss  &  Gustafson  (1977) 

23.  Moore  (1977) 

24.  Murphy  &  Winkler  (1974) 

25.  Murphy  &  Winkler  (1977) 

26.  Nickerson  &  McGoldrick  (1965) 

27.  Oskamp  (1962) 

28.  Phillips  &  Wright  (1977) 

29.  Plckhardt  &  Wallace  (1974) 

30.  Pitz  (1974) 

31.  Root  (1962) 

32.  Schaefer  &  Borcherdlng  (1973) 

33.  Seaver,  von  Winterfeldt  &  Edvards  (1978) 

34.  Selvidge  (1980) 

35.  StaBl  von  Holstein  (1971) 

36.  Stadl  von  Holstein  (1972) 

37.  Tversky  &  Kahneman  (1974) 
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A  study  by  Fischhoff  (1976)  found  no  difference  in  judgments  of  the  likeli¬ 
hood  of  hypothetical  events  set  in  the  future,  present,  or  past.  However, 
the  hypotheticality  of  those  events  may  have  weakened  some  pertinent 
psychological  processes.  The  studies  involving  predictions  cited  above 
(e.g. ,  Murphy,  &  Winkler,  1977;  Root,  1962)  also  follow  the  general 
patterns  observed  in  non-prediction  studies  (i.e.,  overconfidence  except 
with  easy  tasks  or  extensive,  personalized  training).  One  intriguing 
possible  exception  to  these  patterns  is  seen  in  Figure  1,  showing  a  study 
by  Fischhoff  and  Beyth  (1975)  in  which  participants  assessed  the  probabili¬ 
ty  of  various  possible  outcomes  of  President  Nixon's  trips  to  China  and  the 
USSR  (e.g.,  he  will  meet  with  Chairperson  Mao).  At  the  extremes  here,  one 
sees  the  usual  overconfidence.  About  10%  of  the  events  that  respondents 
were  100%  certain  would  happen,  failed  to  happen;  about  10%  of  those  that 
had  0%  chance  of  happening  did  happen.  Nonetheless,  over  most  of  the 
range,  subjects  were  quite  well  calibrated.  An  unpublished  study  by 
Wright  and  Wisudha  (1979)  showed  less  overconfidence  with  forecasts  than 
with  assessment  of  general -knowledge  questions;  fortunately,  the  forecast 
questions  were  also  less  difficult,  suggesting  that  ease  might  have  been 
responsible  for  the  difference  in  calibration. 

Reviewing  this  evidence,  anyone  interested  in  eliciting  and  interpreting 
expressions  of  confidence  in  forecasts  or  in  training  forecasters  to  make 
such  assessments  is  probably  best  off  assuming  that  probability  assess¬ 
ments  for  forecasts  are  no  different  than  those  for  other  problems.  The 
present  study  attempts  to  increase  or  decrease  the  confidence  in  forecasts 
using  tasks  that  are  as  similar  as  possible  to  those  used  in  studies  of 
calibration  with  general -knowledge  questions. 

Study  1. 

The  most  widely-used  task  in  calibration  studies  is  the  half-range  two- 
alternative  question.  Given  an  item  with  two  alternative  answers,  one  of 
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Subjcctt'  Response 


FIGURE  1. 

CALIBRATION  OF  CONFIDENCE  JUDGMENTS  FOR  FORECASTS  REGARDING  POSSIBLE 
OUTCOMES  OF  PRESIDENT  NIXON'S  TRIPS  TO  CHINA  ANO  THE  USSR. 

SOURCE:  FISCHHOFF  &  BEYTH  (1975) 


which  is  guaranteed  to  be  true  (e.g. ,  absinthe  is  (a)  a  precious  stone; 

(b)  a  liqueur),  the  respondent  must  first  select  the  answer  that  seems  more 
likely  to  be  correct  and  then  assess  the  probability  of  that  choice  being 
the  correct  one.  Because  the  more  likely  answer  was  to  have  been  chosen, 
that  probability  should  come  from  the  upper  half  of  the  probability  range: 
[.5,  1.0].  Figure  2  shows  some  typical  results  observed  in  studies  using 
such  tasks.  With  all  but  the  easiest  tasks,  one  finds  overconfidence, 
represented  by  calibration  curves  resting  predominantly  under  the  identity 
line  that  would  reflect  perfect  calibration.  Being  under  the  identity  line 
means  that  the  percentage  of  correct  answers  associated  with  a  particular 
expressed  probability  of  being  correct  is  smaller  than  that  probability. 

In  such  figures,  responses  are  grouped  into  the  intervals  [.50,  .59], 

[.60,  .69],  [.70,  .79],  [.80,  .89],  [.90,  .99],  and  [1.00]. 

The  one  notable  exception  to  this  pattern  was  the  study  by  Koriat,  Lichten¬ 
stein,  and  Fischhoff  (1980)  in  which  overconfidence  was  reduced  (although 
not  altogether  eliminated)  by  having  respondents  provide  a  reason  why  each 
of  their  chosen  answers  might  be  incorrect.  Figure  3  shows  the  effect  of 
this  contradicting  reason  manipulation  along  with  the  non-effect  of  two 
related  manipulations.  (In  the  exhibit,  each  group's  performance  on  the 
experimental  task  is  contrasted  with  its  performance  on  a  set  of  control 
items  for  which  no  reasons  were  given.)  The  supporting-reason  group 
provided  one  reason  why  their  chosen  answer  might  be  correct;  the  both- 
reasons  group  gave  one  supporting  and  one  contradicting  reason.  The 
absence  of  an  effect  with  those  groups  indicated  that  the  contradicting 
reason  group's  calibration  had  not  improved  simply  as  a  result  of  the 
additional  labor  Involved  in  writing  a  reason. 

Study  1  replicates  the  three  conditions  of  Koriat,  Lichtenstein,  and 
Fischhoff  (1980),  using  items  involving  future  events. 
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FIGURE  2. 

REPRESENTATIVE  CALIBRATION  CURVES  DERIVED  FROM  STUDIES  USING  TWO-ALTERNATIVE 
HALF-RANGE  TASKS. 

SOURCE:  LICHTENSTEIN  &  FISCHHOFF  (1977). 
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FIGURE  3. 

CALIBRATION  CURVES  FOR  INDIVIDUALS  PROVIDING  SUPPORTING,  CONTRADICTING,  OR 
SUPPORTING  ANO  CONTRAOICTING  REASONS.  EACH  GROUP'S  CALIBRATION  IS  COMPARED 
WITH  THEIR  OWN  PERFORMANCE  ON  A  SET  OF  CONTROL  ITEMS. 

SOURCE:  KORIAT,  LICHTENSTEIN  &  FISCHHOFF  (1980) 


Method 


Design.  Each  participant  responded  to  50  two-alternative  half-range 
questions,  picking  the  answer  most  likely  to  be  correct  in  each  and  then 
assigning  it  a  probability  (from  .5  to  1.0)  of  being  correct.  The  first 
25  items  were  done  using  standard  assessment  techniques.  For  each  of  the 
last  25  items,  after  respondents  had  selected  an  answer,  and  prior  to 
providing  a  probability,  they  were  required  to  provide  a  reason  supporting 
their  answer,  a  reason  contradicting  it,  or  a  reason  of  each  type.  Details 
may  be  found  in  Koriat,  Lichtenstein,  and  Fischhoff  (1980).  A  no-reasons 
group  responded  to  all  50  items  without  providing  reasons. 

Stimuli .  Fifty  items  were  created  concerning  events  that  would  be 
consummated  within  30  days  of  the  time  of  the  experiments.  Some  dealt 
with  upcoming  local  elections  (e.g. ,  the  mayor  of  Eugene  will  be  (a)  Gus 
Keller;  (b)  Catherine  Lauris.);  others  dealt  with  sporting  events  (e.g., 
who  will  win  the  following  baseball  game:  (a)  Detroit  Tigers;  (b) 
California  Angels  [home  team]);  others  dealt  with  a  variety  of  topics. 
These  items  were  separated  into  two  sets  so  that  items  dealing  with  topics 
that  seemed  at  all  related  would  not  appear  consecutively  or,  to  the 
extent  possible,  in  the  same  set.  Each  set  was  used  in  the  control 
condition  for  half  of  one  group  and  in  the  experimental  condition  for 
the  other  half. 

Subjects .  One  hundred  and  twelve  individuals  were  recruited  through  an 
advertisement  in  the  University  of  Oregon  student  paper.  They  were  paid 
$7  for  completing  this  task  as  one  part  of  a  li-hour  session.  Subject 
groups  recruited  in  this  manner  typically  are  about  half  male  and  half 
female,  with  an  average  age  of  23.  Most  are  Involved  with  the  university 
community;  about  2/3  are  students.  They  treat  the  tasks  in  a  diligent 
manner,  perhaps  akin  to  a  proctored  exam.  We  had  hoped  to  have  a  larger 
nunber  of  subjects;  however,  good  weather  and  the  proximity  of  final  exams 


seem  to  have  kept  numbers  down.  In  all,  there  were  32  people  in  the 
supporting  reasons  group,  28  in  the  contradicting  reasons  group,  26  in  the 
both-reasons  group,  and  26  in  the  group  that  never  gave  any  reasons. 

Results 

Main  effect.  The  items  we  constructed  proved  to  be  fairly  difficult  for 
subjects  with  the  proportion  of  correct  forecasts  over  all  3.447  responses 
in  the  control  conditions  equal  to  only  .618.  Associated  with  these  items 
was  a  Mean  confidence  of  .722.  The  usual  measure  of  over-or-underconfidence 
is  the  difference  between  these  two  statistics.  Here  it  equals  +.102, 
indicating  that  subjects'  percentage  of  correct  predictions  (in  the  control 
condition)  should  have  been  higher  by  10.2%  if  their  level  of  confidence 
was  to  be  justified.  The  calibration  curve  corresponding  to  these  res¬ 
ponses  appears  in  Figure  4.  Respondents'  overconfidence  is  reflected  by 
the  fact  that  most  of  the  curve  falls  below  the  identity  line.  The 
generally  positive  slope  of  the  curve  indicates  that  subjects  tended  to 
be  more  knowledgeable  when  they  were  more  confident.  Its  flatness, 
relative  to  the  identity  line,  indicates  that  their  knowledge  did  not  rise 
as  quickly  as  did  their  confidence.  This  curve  pertaining  to  forecasts 
looks  strikingly  like  that  observed  with  general  knowledge  questions  of 
the  same  difficulty  level  (e.g.,  the  bottom  curves  in  Figure  2). 

Reasons  manipulation.  Figure  5  contrasts  each  group  of  subjects'  calibra¬ 
tion  on  the  experimental  condition  with  their  own  performance  on  the 
control  condition.  Thus,  it  is  comparable  to  Figure  3  from  Koriat,  Lichten¬ 
stein,  and  Fischhoff  (1980).  As  a  rough  guide  to  the  stability  of  these 
curves,  in  each,  there  are  approximately  100  (±30)  responses  involved  in 
determining  the  proportions  correct  associated  with  probabilities  of  .6, 

.7,  .8,  and  .9.  If  these  were  all  Independent  responses,  that  would  mean 
a  standard  error  of  estimate  of  approximately  .05;  however,  subjects 
typically  contributed  several  responses  to  each  point.  Approximately  175 
(±30)  responses  were  associated  with  .5  and  about  60  (±30)  with  1.0 


Proportion  Correct 


ALL  CONTROL-  STUDY  1 


FIGURE  4. 

CALIBRATION  OF  ALL  RESPONSES  TO  CONTROL  ITEMS  IN  STUDY  1.  CURVE  INCLUDES 
3,447  RESPONSES  PRODUCED  BY  112  INDIVIDUALS. 


The  clearest  conclusion  to  be  drawn  from  Figure  5  is  that  there  are  few, 
if  any,  systematic  differences  between  the  control  and  experimental 
conditions  for  any  group.  The  supporting  reasons  group,  which  showed  no 
change  at  all  in  the  Koriat  et  al.  study,  seems  to  have  improved  somewhat; 
however,  even  these  differences  seem  small  relative  to  statistical  varia¬ 
bility.  The  performance  of  these  groups  in  the  control  and  experimental 
conditions  is  sunmarized  several  ways  in  Table  2.  Here,  we  find  that  the 
experimental  manipulation  had  little  effect  on  the  confidence  of  support¬ 
ing  or  contradicting  reasons  subjects  (slightly  increasing  it  for  the 
former,  slightly  decreasing  it  for  the  latter),  but  it  reduced  the  mean 
confidence  of  both-reasons  subjects  from  .724  to  .663.  This  last  change 
would  have  cut  the  overconfidence  that  those  subjects  showed  in  the 
control  condition  by  2/3  were  there  not  a  concomitant  drop  in  their 
proportion  of  correct  responses  (from  .626  to  .599).  All  in  all,  each  of 
the  three  groups  was  somewhat  less  overconfident  in  their  respective 
experimental  conditions.  This  modest  improvement  is  also  reflected  in  the 
group  calibration  scores  shown  in  Table  2.  This  score,  derived  from  the 
partition  of  the  Brier  proper  scoring  rule  (see  Lichtenstein,  Fischhoff, 

&  Phillips,  in  press),  reflects  the  squared  distance  between  the  calibra¬ 
tion  curve  and  identity  line,  weighted  by  the  nunber  of  responses  at  each 
point.  It  decreases  as  calibration  improves,  becoming  zero  with  perfect 
calibration. 

In  Figure  5,  each  group's  performance  on  the  reasons  task  was  compared  to 
their  own  performance  on  the  control  (no  reasons)  task.  Although  such 
within-subject  comparisons  allow  greater  sensitivity  of  analysis,  they 
greatly  reduce  the  nunber  of  responses  involved  in  each  comparison.  If 
one  pools  all  responses  to  control  questions  (as  in  Figure  4),  there 
appears  to  be  slight  improvement  in  each  experimental  condition,  particu¬ 
larly  with  the  both  and  contradicting  reasons  group. 
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TABLE  2 

SUMMARY  STATISTICS  FOR  STUDY  1 


Control  Experiment 


Group 

N 

n 

Prop. 

Cor. 

Mean 
Conf . 

Over- 

conf. 

Calih 

n 

Prop. 

Cor. 

Mean 

Conf. 

Over- 
conf . 

Calih. 

No  Reasons 

26 

1299 

.625 

.724 

.099 

.0227 

— 

— 

— 

— 

— 

Supporting 

32 

800 

.608 

.726 

.118 

.0264 

798 

.639 

.735 

.096 

.0166 

Contra dieting 

28 

698 

.607 

.711 

.104 

.0239 

685 

.616 

.702 

.086 

.0211 

Both 

26 

650 

.626 

.724 

.098 

.0225 

643 

.599 

.663 

.064 

.0123 

All 


112  3447  .618  .722  .104  .0271  2126  .619  .703  .083  .0160 


The  preceding  analyses  assune  that  the  experimental  manipulations  were 
uniformly  effective.  Koriat  et  al.  discovered  a  moderate  percentage  of 
items  for  which  reasons  were  either  missing  or  inappropriate.  This  was 
particularly  true  with  the  contradicting  reasons  group,  who  often  gave 
supporting  reasons.  Each  of  the  present  groups  omitted  reasons  for 
approximately  10%  of  all  items.  When  supporting  reasons  subjects  gave 
reasons,  they  were  almost  always  appropriate  to  the  task  (99%  of  the  time). 
On  the  other  hand,  11%  of  the  contradicting  reasons  subjects'  reasons  were 
inappropriate,  constituting  either  supporting  reasons  or  vague  statements 
such  as  "Maybe  I'm  wrong."  For  both- reasons  subjects,  5%  of  their 
supporting  reasons  were  inappropriate,  compared  with  9%  of  their  contradic¬ 
ting  reasons.  As  in  Koriat  et  al.,  providing  contradicting  reasons  appears 
to  be  a  difficult  or  unnatural  task.  The  total  nunber  of  these  missing  and 
inappropriate  responses  was  not  large  enough  that  their  elimination  changes 
the  calibration  curves  of  Figure  5  appreciably. 

Distribution  of  responses.  As  mentioned  earlier,  on  the  control  questions, 
subjects  made  roughly  equal  use  of  all  the  responses. 6,  .7,  .8,  and  .9; 
they  used  .5  somewhat  more,  1.0  somewhat  less.  Distributions  for  the 
experimental  conditions  were  quite  similar  except  for  a  slight  increase  in 
.5's  and  decrease  in  1.0' s.  This  tendency  was  particularly  marked  in  the 
both-reasons  group,  whose  proportion  of  .5's  increased  from  .234  to  .364 
and  whose  proportions  of  1.0's  dropped  from  .112  to  .048,  thus  accounting 
for  its  reduced  overall  confidence. 

Table  3  shows  another  aspect  of  response  usage,  the  percentage  of  subjects 
who  expressed  confidence  of  1.0  in  at  least  one  of  their  30  forecasts.  In 
previous  studies  with  general  knowledge  questions,  typically  all  or  almost 
all  subjects  have  used  1.0.  The  fact  that  only  76%  of  all  subjects  did  so 
on  the  control  task  suggests  some  tendency  not  to  express  extreme  certainty 
in  forecasts.  This  tendency  was  highlighted  in  the  experimental  tasks, 
where  even  fewer  subjects  used  1.0,  particularly  for  the  contradicting  and 
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TABLE  3 

USAGE  AND  NON-USAGE  OF  1.0 
(PERCENTAGE  OF  USERS) 


Study 

1 

2 

3 

All 

Group 

Control 

Exp. 

Control 

Exp. 

Control 

Exp. 

Control 

Exp. 

No  Reasons 

80.8 

— 

— 

— 

— 

— 

80.8 

— 

Supporting 

75.0 

65.6 

73.1 

63.5 

81.3 

74.7 

77.7 

69. 

Contradicting 

71.4 

50.0 

67.4 

44.2 

76.3 

60.5 

71.6 

51.' 

Both 

76.9 

50.0 

75.0 

50.0 

79.5 

61.4 

77.1 

54. 

All 


76.8  55.8 


72.0  52.4 


79.8  68.2 


76.2  60 


both-reasons  groups.  The  responses  of  all  subjects  who  did  and  did  not  use 
1.0  were  pooled  separately  over  all  experimental  groups.  For  the  control 
conditions,  non-users  were  appreciably  better  calibrated  all  along  the 
calibration  curve  (not  shown).  Subjects  who  never  expressed  extreme  confi¬ 
dence  were  not  only  less  confident,  but  also  more  in  tune  with  the  extent 
of  their  knowledge.  With  the  reasons  conditions,  the  same  change  was 
observed,  but  its  size  was  smaller.  Users  and  non-users  of  1.0  had  highly 
similar  percentages  of  correct  responses,  hence  differences  in  calibration 
cannot  be  attributed  to  differences  in  difficulty  level.  As  can  be  seen 
from  the  remainder  of  Table  3,  similar  patterns  were  observed  in  the 
following  studies.  Calibration  curves  for  these  studies  will  be  reported 
and  discussed  later. 

Discussion 


Lost  in  this  morass  of  mild  and  inconclusive  effects  is  the  striking  main 
effect  shown  in  Figure  4.  Calibration  for  confidence  in  forecasts  looks 
just  like  calibration  for  confidence  in  general  knowledge,  when  difficulty 
level  is  controlled.  These  forecasters'  accuracy  increased  with  their 
confidence;  however,  it  did  not  increase  as  fast.  As  confidence  rose  from 
.5  to  1.0,  the  corresponding  proportion  of  correct  predictions  only 
increased  from  .5  to  .75.  Respondents  also  tended  to  be  overconfident  in 
the  extent  of  their  knowledge,  getting  62%  of  their  predictions  right,  but 
having  a  mean  probability  of  .72.  The  one  difference  that  does  emerge  is 
a  modest  reduction  in  usage  of  1.0.  The  superior  calibration  of  subjects 
who  never  used  1.0  was  a  promising  predictor  of  individual  differences  in 
calibration.  Despite  this  overall  similarity,  confidence  in  forecasts  did 
not,  however,  show  the  same  responsiveness  to  the  reasons  manipulations 
observed  in  Koriat  et  al.  There  was  some  suggestion  of  improved  calibra¬ 
tion  with  the  supporting  and  contradicting  groups.  However,  the  relatively 
small  sample  rendered  these  results  somewhat  ambiguous.  Before  reaching 
any  firm  conclusion,  it  seemed  appropriate  to  increase  the  sample  size. 


Because  the  events  had  already  occurred  by  the  time  these  analyses  were 
completed,  it  was  not  possible  to  add  subjects  to  the  existing  groups. 
Instead,  a  second  study  was  run,  replicating  the  first,  but  with  a  new  set 
of  events. 

Study  2. 

Method 


The  design  of  Study  2  followed  that  of  Study  1  except  for  the  elimination 
of  the  no-reasons  group  (which  completed  50  forecasts  without  giving  any 
reasons)  and  an  increase  in  the  number  of  forecasts  from  50  to  60.  As  the 
study  was  completed  in  late  October,  1980,  a  number  of  the  forecasts 
considered  the  elections  of  the  following  month.  A  total  of  143  subjects, 
recruited  as  in  Study  1,  participated.  This  nunber,  too,  was  somewhat  less 
than  we  had  hoped  for,  but  did  allow  for  groups  roughly  2/3  larger  than  in 
Study  1. 

Main  effect.  The  difficulty  of  the  present  items  in  the  control  tasks 
proved  to  be  remarkably  similar  to  that  of  Study  1  (62.9%  correct  vs. 
61.8%),  as  was  subjects'  mean  confidence  (.732  vs.  .722).  Subjects'  over- 
confidence  was  correspondingly  almost  identical  (.103  vs.  .102). 

Reasons  manipulation.  Table  4  summarizes  results  for  the  control  and 
experimental  conditions  of  each  of  the  three  groups.  Briefly,  the  only 
apparent  effect  on  overconfidence  was  the  improvement  of  the  both-reasons 
group.  The  other  two  groups  were  essentially  unchanged.  The  calibration 
curves  for  these  groups  were  so  similar  to  those  from  Study  1  (Figure  5), 
that  they  will  not  be  shown.  There  were,  again,  qhite  a  few  missing  and 
inappropriate  reasons,  particularly  for  contradictory  reasons.  Elimination 
of  these  responses  does  not,  however,  appreciably  change  the  patterns  shown 
in  Table  4. 
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Distributions  of  responses.  Presumably  reflecting  the  increased  sample 
size,  the  distributions  of  the  three  groups'  probability  assessments  on 
the  control  tasks  were  quite  similar.  In  the  reasons  conditions,  usage  of 
.5  tended  to  increase  for  all  groups,  whereas  usage  of  1.0  decreased  some- 
what.  As  in  Study  1,  a  substantial  group  of  subjects  never  used  1.0  (see 
Table  3).  Some  28%  followed  this  pattern  in  the  control  condition  and 
46.9%  in  the  reasons  conditions.  This  increase  was  much  greater  for  the 
contradicting  and  both-reasons  groups,  over  half  of  whose  subjects  never 
used  1.0  in  the  experimental  conditions.  The  calibration  curves  for  all 
subjects  who  did  not  use  1.0  showed  them  to  be  less  overconfident  and 
generally  better  calibrated  than  the  remaining  subjects,  both  for  reasons 
and  controls.  As  in  Study  1,  the  task  was  equally  difficult  for  users  and 
non-users. 

Discussion 


The  major  results  of  Study  1  have  been  replicated:  Calibration  curves  for 
overconfidence  in  forecasts  resemble  those  for  confidence  in  general  know¬ 
ledge  questions.  The  reasons  manipulations  had  at  best  weak  effects  on 
overall  calibration.  The  contradicting  and  both-reasons  manipulations  did, 
however,  again  reduce  usage  of  1.0.  In  general,  subjects  who  never  used 
1.0  were  better  calibrated  than  their  counterparts. 

The  overall  similarity  of  the  present  confidence  judgments  to  those 
observed  elsewhere  is  encouraging  for  anyone  who  would  like  to  exploit  that 
literature  for  the  elicitation  and  interpretation  of  forecasts.  For 
example,  we  would  expect  the  training  techniques  that  have  proven  effective 
or  ineffective  with  general  knowledge  items  to  have  similar  effects  on  the 
calibration  of  forecasters.  The  difference  observed  here  between  users  and 
non-users  of  1.0  may  offer  an  additional  tool  for  determining  how  much  faith 
to  place  in  others'  confidence  assessments.  The  weakness  of  the  reasons 
manipulations  is,  however,  disappointing,  because  it  suggests  that  a  simple 
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mechanism  that  has  proven  effective  in  improving  calibration  is  not  as 
robust  as  one  would  hope.  Before  writing  off  this  procedure  and  discussing 
some  possible  implications  of  this  research  for  forecasting,  we  will  offer 
one  further  replication  designed  to  strengthen  the  reasons  manipulation. 

Study  3. 

Method 


Although  this  study  was  essentially  a  replication  of  the  previous  two,  a 
number  of  changes  were  introduced  in  order  to  strengthen  the  reasons  manipu¬ 
lation:  (a)  the  number  of  items  per  page  was  reduced  from  4  to  3  in  order 
to  present  a  less  cramped  format;  (b)  subjects  were  asked  to  produce  notone, 
but  two  reasons  of  the  type  required  by  each  condition;  (c)  The  instructions 
were  changed  to  emphasize  that  the  task  involved  making  predictions  about 
future  events,  and  that  descriptions  of  things  heard  or  read,  beliefs  and 
associations  could  all  be  used  as  reasons  for  the  predictions  made;  (d) 
subjects  were  asked  to  make  a  special  effort  to  be  as  complete  in  describing 
their  reasons  as  possible;  (e)  subjects  were  assured  that  sufficient  time 
had  been  allotted  in  the  experiment  for  them  to  devote  thought  to  the  task. 
All  stimuli  dealt  with  events  whose  outcome  would  be  known  during  the  first 
week  of  June,  1981.  Technical  aspects  of  subject  recruitment  caused 
responses  to  be  elicited  on  two  separate  dates.  May  15  and  May  29,  two  weeks 
before  the  event  period  and  immediately  before.  On  May  15,  half  of  the 
participants  were  in  each  of  the  supporting  and  contradicting  reasons  groups . 
Comparisons  between  the  corresponding  supporting  groups  at  the  two  times 
will  reveal  whether  proximity  to  events  has  any  effect  on  calibration  beyond 
its  effect  on  difficulty.  One  hundred  and  seventy- three  individuals 
participated,  with  roughly  equal  numbers  on  the  two  dates. 
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Results 


Timing.  The  proportion  of  correct  responses  to  control  questions  was  higher 
by  .03  for  the  supporting  group  from  May  29  than  for  the  May  15  supporting 
group,  perhaps  due  to  the  former's  closer  proximity  to  the  events  in 
question.  The  May  29  group's  confidence  (and  overconfidence)  was  corres¬ 
pondingly  higher,  leaving  their  calibration  curves  quite  similar. 
Corresponding  changes  were  seen  in  the  two  groups'  responses  to  the 
experimental  condition  items,  except  that  the  May  29  group  was  a  bit  less 
overconfident  (.074  vs.  .101).  As  there  is  no  apparent  reason  for  this 
anomaly,  the  two  groups'  data  from  the  two  dates  will  be  pooled  in  the 
following  analyses. 

Main  effect.  Table  5  shows  the  same  patterns  in  responses  to  the  control 
questions  as  were  seen  in  Studies  1  and  2.  Each  group  is  somewhat  over¬ 
confident.  The  poor  calibration  statistic  for  the  contradicting  reasons 
group,  despite  its  relatively  low  overconfidence,  reflects  a  very  flat 
calibration  curve,  with  only  a  .12  difference  between  the  proportions 
correct  associated  with  responses  of  .5  and  1.0. 

Reasons  manipulation.  As  indicated  by  Table  5,  the  reasons  manipulations 
slightly  reduced  overconfidence  and  slightly  improved  calibration  for  all 
three  groups.  As  shown  in  Table  3,  they  also  reduced  the  usage  of  1.0. 

All  these  effects  were  somewhat  larger  for  the  contradicting  and  boht- 
reasons  groups.  As  before,  non-users  of  1.0  were  considerably  better 
calibrated  than  users. 


DISCUSSION 

Three  clear  patterns  have  emerged  from  these  three  studies,  each  with  some 
possible  implications  for  forecasting  practitioners: 
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TABLE  5 


SUMMARY  OF  STATISTICS  -  STUDY  3 


Control  Experiment 


Group 

N 

n 

Prop. 

Cor. 

Mean 

Conf . 

Over- 
conf . 

Calib. 

n 

Prop. 

Cor. 

Mean 

Conf. 

Over¬ 
con  f. 

Calib. 

Supporting 

91 

2625 

.650 

.746 

.096 

.0198 

2610 

.657 

.745 

.088 

.0151 

Contradicting 

38 

1098 

.655 

.724 

.069 

.0275 

1086 

.656 

.706 

.051 

.0231 

Both 

44 

1264 

.654 

.737 

.083 

.0186 

1258 

.652 

.723 

.071 

.0113 

All 

164 

4987 

.652 

.739 

.087 

.0201 

4954 

.655 

.731 

.076 

.0172 

i 
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1.  Calibration  for  confidence  assessments  regarding  forecasts  is 
largely  indistinguishable  from  that  pertaining  to  general  knowledge 
questions.  The  overconfidence  scores  and  calibration  curves  observed  with 
the  control  items  here  were  very  similar  to  those  observed  with  general 
knowledge  items  of  similar  difficulty.  On  the  basis  of  these  results,  one 
should  have  considerably  increased  confidence  in  extrapolating  the  results 
of  earlier  calibration  research  to  confidence  in  forecasts.  Thus,  one 
might  expect  calibration  for  forecasts  to  be  relatively  unaffected  by 
changes  in  response  mode,  incentive  payments  for  correct  answers,  or 
familiarity  with  subject  matter  (unless  accompanied  by  a  change  in  diffi¬ 
culty),  to  generalize  a  few  results  from  Table  1. 

2.  The  only  apparent  difference  between  these  responses  and  those 
observed  previously  was  the  appearance  of  a  subsample  of  subjects  who  never 
used  1.0.  Over  all  three  studies,  such  subjects  constituted  23.8%  of  the 
control  groups  and  39.8%  of  the  experimental  groups.  As  shown  in  Table  6, 
non-users  of  1.0  were  consistently  much  better  calibrated  than  users,  in 
all  three  studies,  for  both  control  and  experimental  items.  Figure  6  pools 
responses  of  users  and  non-users  across  the  three  studies.  Each  curve 
includes  5,000-15,000  responses  made  by  100  to  300  subjects.  Although  non¬ 
users  are  somewhat  better  calibrated  for  most  probability  values,  the  major 
difference  between  the  groups  is  at  1.0.  Non-users  simply  do  not  produce 
the  point  that  represents  the  greatest  overconfidence,  that  is,  the  greatest 
discrepancy  between  how  often  one  should  be  correct  and  how  often  one  is. 

On  the  basis  of  these  results,  one  might  tentatively  extend  greater  credence 
to  the  confidence  assessments  of  forecasters  who  never  express  complete 
certitude. 

3.  The  reasons  manipulations  had  consistent  but  weak  effects.  In  each 
study,  responses  in  the  experimental  condition  were  better  calibrated  and 
less  overconfident  than  those  in  the  corresponding  control  conditions.  Over 
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TABLE  6 


SUMMARY  STATISTICS  FOR  USERS  AND  NON-USERS  OF  1.0 
GROUP  MEANS 


Study 

Group 

1 

Control 

Exp. 

2 

Control 

Exp. 

3 

Control 

Exp. 

All 

Control 

Exp. 

Prop.  Correct 

Users 

.617 

.625 

.635 

.625 

.656 

.653 

.638 

.639 

Non-users 

.520 

.613 

.614 

.629 

.636 

.660 

.623 

.636 

Mean  Prob. 

Users 

.736 

.736 

.751 

.757 

.750 

.748 

.747 

.748 

Non-users 

.075 

.660 

.686 

.673 

.685 

.685 

.683 

.674 

Overconf ldence 

Users 

.119 

•  111 

.115 

.132 

.094 

.095 

.108 

.109 

Non-users 

.055 

.047 

.072 

.047 

.049 

.025 

.060 

.038 

Calibration 

Users 

.0249 

.0212 

.0197 

.0257 

.0218 

.0182 

.0215 

.0208 

Non-users 

.0065 

.0081 

.0065 

.0065 

.0121 

.0079 

.0078 

.0068 

all  three  studies,  overconfidence  decreased  by  .008  for  supporting  reasons 
subjects  (from  .101  to  .093),  by  .007  for  contradicting  reasons  subjects 
(from  .086  to  .079),  and  by  .032  for  both-reasons  subjects  (from  .101  to 
.069).  In  an  applied  situation,  one  might  wonder  if  such  modest  improve¬ 
ments  were  worth  the  additional  time  and  effort  the  provision  of  reasons 
requires.  Of  course,  one  might  also  feel  that  the  provision  of  explicit 
reasons  has  desirable  features  independent  of  its  impact  on  calibration. 

These  could  include  (a)  providing  a  record  of  the  reasons  motivating  one's 
forecasts  in  order  to  avoid  the  prejudicial  effects  of  hindsight  bias  when 
the  time  comes  to  evaluate  them,  once  the  event  has  or  has  not  happened 
(Fischhoff,  1975);  (b)  allowing  for  external  review  of  one's  reasoning, 
perhaps  leading  to  the  correction  of  misconceptions  or  improved  communica¬ 
tions  (Hogarth  &  Makridakis,  1981);  or  (c)  helping  raise  one's  alertness  to 
new  evidence  that  should  prompt  revisions  of  a  forecast  (Armstrong,  1978). 

It  is  worth  noting  in  this  context  that  the  most  dramatic  effect  demonstrated 
by  Koriat,  Lichtenstein  and  Fischhoff  (1980)  was  found  with  a  much  more 
involved  procedure  than  that  depicted  in  Figure  3  and  repeated  in  the  present 
studies.  In  a  separate  experiment,  they  required  subjects  to  complete  a 
2x2  matrix  giving  reasons  for  and  against  each  of  the  two  possible  answers. 
Ten,  rather  than  thirty,  items  were  used  in  that  study.  The  more  ambitious 
and  focused  manipulation  reduced  confidence  by  .023,  while  increasing  the 
percentage  of  correct  answers  by  .040,  thereby  reducing  overconfidence  by 
.063.  Perhaps  one  must  conclude  that  provision  of  one  or  two  reasons  for 
each  of  a  fairly  large  mmber  of  items  cannot  hurt,  but  it  cannot  be  counted 
to  help  very  much. 

The  most  consistent  effect  of  the  reasons  manipulations,  in  particular  the 
provision  of  contradicting  or  both  reasons,  was  to  increase  the  proportion 
of  subjects  who  never  used  1.0.  As  mentioned,  these  non-users  were  better 
calibrated  than  users  in  both  the  control  and  experimental  conditions. 

Indeed,  one  might  speculate  that  the  primary  effect  of  the  reasons  manipula¬ 
tions  is  to  indirectly  convince  some  people  never  to  be  entirely  certain. 
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FOOTNOTES 


1.  An  alternative  use  of  "calibration"  found  in  the  forecasting 
literature  is  "to  extimate  the  relationships  (and  constant  terms)  in  a 
forecasting  model "  (Armstrong,  1978,  p.  477).  In  addition,  several  other 
terms  are  at  times  used  to  describe  the  calibration  of  probability  assess¬ 
ments  (see  Lichtenstein,  Fischhoff  &  Phillips,  in  press). 
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